1. Introduction

Our group was initially interested in exploring Public Health-related datasets due to our desire of exploring geospatial details of disease and how that affects different geographies. COVID-19 was a huge influence to us wanting to experiment with another form of health concern. This specific dataset includes cases of the Dengue Fever in the city of Merida, from 2012 to 2015, and outlines the coordinates of where every case specifically took place. Through our analysis, we plan on visualizing how the Dengue Fever has spread across Merida over a four-year period, and how it can help inform how diseases typically spread in urban environments.

2. Setup and Data Loading

This section’s primary goal is to load our dataset, Merida_Den12_13_14_15.csv, into the R environment. To accomplish this, and to equip our notebook for the complex analysis that follows, we first load a complete suite of R packages. We load the tidyverse package, which provides core tools like dplyr for data transformation and ggplot2 for static visualization. For our more advanced interactive plots, we load plotly (used for the animated density map) and leaflet (used for the interactive Year-over-Year map). We also include sf to read and handle the simple features (geospatial) data for Merida. Once all libraries are loaded, we use the read.csv() function to import the dataset into a data frame and verify its contents and structure.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(leaflet) 
## Warning: package 'leaflet' was built under R version 4.3.2
library(sf) 
## Warning: package 'sf' was built under R version 4.3.3
## Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(htmltools)
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
data <- read.csv('Merida_Den12_13_14_15.csv')

head(data)
##          X       Y SP_ID        CVEGEO Den2012 Den2013 Den2014 Den2015
## 1 231639.6 2317510     0 3104100010070      17       6       2       2
## 2 231323.1 2316622     1 3104100010085      35      15      10      18
## 3 230769.7 2316113     2 310410001009A      35      14       8      17
## 4 233250.8 2316208     3 3104100010136      40       8       6      10
## 5 233266.3 2317442     4 3104100010140      12       8       4       4
## 6 232736.6 2317238     5 3104100010155      30       8       1       5

3. Data Cleaning and Transformation

data_cleaned <- data %>%
  select(X, Y, starts_with("Den"))
na_counts <- colSums(is.na(data_cleaned))
print(na_counts)
##       X       Y Den2012 Den2013 Den2014 Den2015 
##       0       0       0       0       0       0
head(data_cleaned)
##          X       Y Den2012 Den2013 Den2014 Den2015
## 1 231639.6 2317510      17       6       2       2
## 2 231323.1 2316622      35      15      10      18
## 3 230769.7 2316113      35      14       8      17
## 4 233250.8 2316208      40       8       6      10
## 5 233266.3 2317442      12       8       4       4
## 6 232736.6 2317238      30       8       1       5

Data Cleaning is a crucial step to ensure the integrity and relevance of the data prior to our analysis. IWe want to ensure that all the data required for this project is clean, validated, and standardized so that our group can create data visualizations using the shared data.

The raw dataset contains several columns, including SP_ID and GEOID, which serve as identifiers for position and case record. After discussing the goals for the project, we determined that our exploratory analysis is focused exclusively on the relationship between spatial coordinates and case counts. Therefore the important features for this analysis are the X and Y coordinates and the annual case count columns (Den2012 through Den2015). The SP_ID and GEOID columns, while potentially useful for future work (e.g., joining with demographic or census data), are not required for our current visualization and modeling objectives. Therefore, we dropped the project’s irrelevant columns creating a standard data frame named data_cleaned.

The second important step is to check for missing data. Missing data can introduce significant bias, propagate errors in calculations, and cause visualization failures. The R console output confirms that all columns returned a count of “0”. This is an important finding, as it validates the completeness of our dataset. We can therefore proceed with our analysis without the need for data imputation or row-wise deletion, ensuring that our results are based on the full set of observations.

5. Exploratory Analysis 2: Static Geospatial Patterns - can rename

#maybe overlay the map of the location
# This plot shows the geospatial location of Dengue incidence in Merida, Mexico. We can potentially aesthetically improve this plot later using QGIS or ArcGIS.
plot(data$X, data$Y, 
     main = "Spatial Locations of Sampling Points",
     xlab = "X Coordinate", 
     ylab = "Y Coordinate", 
     pch = 19, col = "blue")

library(ggplot2)
# 1️ Read file
library(sf)

town_map <- st_read("Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp")
## Reading layer `Merida_Den12_13_14_15' from data source 
##   `/Users/Tom/Desktop/dengueAnalysisTechWriting-main/Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 540 features and 6 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 212287.2 ymin: 2309495 xmax: 236386 ymax: 2333181
## Projected CRS: Mexico ITRF2008 / UTM zone 16N
st_crs(town_map)
## Coordinate Reference System:
##   User input: Mexico ITRF2008 / UTM zone 16N 
##   wkt:
## PROJCRS["Mexico ITRF2008 / UTM zone 16N",
##     BASEGEOGCRS["Mexico ITRF2008",
##         DATUM["Mexico ITRF2008",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",6365]],
##     CONVERSION["UTM zone 16N",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",0,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",-87,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",0.9996,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",500000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",0,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["(E)",east,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["(N)",north,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["Engineering survey, topographic mapping."],
##         AREA["Mexico east of 90°W, onshore and offshore."],
##         BBOX[17.81,-90,25.77,-84.64]],
##     ID["EPSG",6371]]
disease_data <- read.csv('Merida_Den12_13_14_15.csv')

# make sure CRS is same, convert to sf
disease_sf <- st_as_sf(disease_data, coords = c("X", "Y"), crs = st_crs(town_map))
names(disease_data)
## [1] "X"       "Y"       "SP_ID"   "CVEGEO"  "Den2012" "Den2013" "Den2014"
## [8] "Den2015"
# 4 generate plot
ggplot() +
  geom_sf(data = town_map, fill = "gray95", color = "black", linewidth = 0.3) +
  geom_sf(data = disease_sf, aes(color = Den2012, size = Den2012)) +
  scale_color_viridis_c(option = "plasma") +
  theme_minimal() +
  labs(title = "Spatial Locations",
       color = "Incidence",
       size = "Incidence")

6.5 combining 5 and 6 into 1 section

# 1. Convert data to sf and transform to WGS84 for Leaflet
# We use EPSG:6371 as identified in the original Section 7
data_sf <- st_as_sf(data, coords = c("X", "Y"), crs = 6371)
data_sf_wgs84 <- st_transform(data_sf, crs = 4326)

# 2. Load and transform the neighborhood shapefile
town_map_raw <- st_read("Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp")
## Reading layer `Merida_Den12_13_14_15' from data source 
##   `/Users/Tom/Desktop/dengueAnalysisTechWriting-main/Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 540 features and 6 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 212287.2 ymin: 2309495 xmax: 236386 ymax: 2333181
## Projected CRS: Mexico ITRF2008 / UTM zone 16N
town_map_wgs84 <- st_transform(town_map_raw, crs = 4326)

# 3. Create a sequential color palette for absolute cases
# We find the global max to keep the scale consistent across years
all_cases <- c(data_sf_wgs84$Den2012, data_sf_wgs84$Den2013, data_sf_wgs84$Den2014, data_sf_wgs84$Den2015)
global_max_cases <- max(all_cases, na.rm = TRUE)
pal_cases <- colorNumeric(
  palette = "Reds",
  domain = c(0, global_max_cases),
  na.color = "#a0a0a0"
)

# 4. Create popups for each year
popup_2012 <- ~lapply(sprintf("<strong>ID: %s</strong><br/>2012 Cases: %d", SP_ID, Den2012), HTML)
popup_2013 <- ~lapply(sprintf("<strong>ID: %s</strong><br/>2013 Cases: %d", SP_ID, Den2013), HTML)
popup_2014 <- ~lapply(sprintf("<strong>ID: %s</strong><br/>2014 Cases: %d", SP_ID, Den2014), HTML)
popup_2015 <- ~lapply(sprintf("<strong>ID: %s</strong><br/>2015 Cases: %d", SP_ID, Den2015), HTML)

# 5. Create the interactive leaflet map
m_cases <- leaflet(data_sf_wgs84) %>%
  # Add base map layers
  addProviderTiles(providers$CartoDB.Positron, group = "Light Map") %>%
  addProviderTiles(providers$Esri.WorldImagery, group = "Satellite") %>%
  
  # Add Neighborhoods layer
  addPolygons(
    data = town_map_wgs84,
    group = "Neighborhoods",
    fillOpacity = 0.1,
    color = "#444",
    weight = 1,
    opacity = 0.7,
    popup = ~SP_ID
  ) %>%
  
  # Set the map to focus on the data
  fitBounds(
    ~as.numeric(st_bbox(geometry)[1]),
    ~as.numeric(st_bbox(geometry)[2]),
    ~as.numeric(st_bbox(geometry)[3]),
    ~as.numeric(st_bbox(geometry)[4])
  ) %>%

  # ==== Layer 1: 2012 Cases ====
  addCircleMarkers(
    group = "2012 Cases",
    color = ~pal_cases(Den2012),
    fillOpacity = 0.8,
    stroke = FALSE,
    radius = 5,
    popup = popup_2012,
    label = ~as.character(Den2012)
  ) %>%
  
  # ==== Layer 2: 2013 Cases ====
  addCircleMarkers(
    group = "2013 Cases",
    color = ~pal_cases(Den2013),
    fillOpacity = 0.8,
    stroke = FALSE,
    radius = 5,
    popup = popup_2013,
    label = ~as.character(Den2013)
  ) %>%
  
  # ==== Layer 3: 2014 Cases ====
  addCircleMarkers(
    group = "2014 Cases",
    color = ~pal_cases(Den2014),
    fillOpacity = 0.8,
    stroke = FALSE,
    radius = 5,
    popup = popup_2014,
    label = ~as.character(Den2014)
  ) %>%
  
  # ==== Layer 4: 2015 Cases ====
  addCircleMarkers(
    group = "2015 Cases",
    color = ~pal_cases(Den2015),
    fillOpacity = 0.8,
    stroke = FALSE,
    radius = 5,
    popup = popup_2015,
    label = ~as.character(Den2015)
  ) %>%
  
  # ==== Legend ====
  addLegend(
    pal = pal_cases,
    values = c(0, global_max_cases),
    title = "Case Count",
    position = "bottomright",
    opacity = 1
  ) %>%

  # ==== Layer Control ====
  addLayersControl(
    baseGroups = c("2012 Cases", "2013 Cases", "2014 Cases", "2015 Cases"),
    overlayGroups = c("Light Map", "Satellite", "Neighborhoods"),
    options = layersControlOptions(collapsed = FALSE)
  )

# Display the map
m_cases

7. Exploratory Analysis 4: Analyzing Year-over-Year (YoY) Change - can rename

# 1. Calculate Year-over-Year (YoY) percentage change
# We must handle cases where the denominator (previous year) is 0.
calculate_yoy <- function(current, previous) {
  case_when(
    previous == 0 & current == 0 ~ 0,  # 0 to 0 is 0% change
    previous == 0 & current > 0 ~ Inf, # 0 to N is infinite % change
    previous > 0 ~ (current - previous) / previous
  )
}

data_yoy <- data %>%
  mutate(
    YoY_2013 = calculate_yoy(Den2013, Den2012),
    YoY_2014 = calculate_yoy(Den2014, Den2013),
    YoY_2015 = calculate_yoy(Den2015, Den2014)
  )

# 2. Clean up Inf/-Inf/NaN values for visualization. We'll set Inf to NA.
data_yoy <- data_yoy %>%
  mutate(across(starts_with("YoY_"), ~if_else(is.finite(.), ., NA_real_)))

# 3. Convert to a geospatial 'sf' object and transform coordinates
# The .prj file (EPSG:6371) tells us this is Mexico ITRF2008 / UTM zone 16N.
# Leaflet requires standard WGS84 (EPSG:4326) coordinates (lat/lon).
data_sf <- st_as_sf(data_yoy, coords = c("X", "Y"), crs = 6371)
data_sf_wgs84 <- st_transform(data_sf, crs = 4326)

# 4. Define a diverging color palette
# We cap the domain from -1 (-100%) to 2 (200%) to handle outliers
cap_range <- c(-1, 2)
pal_yoy <- colorNumeric(
  palette = "RdYlBu", 
  domain = cap_range, 
  na.color = "#a0a0a0",
  reverse = TRUE # Make red positive (increase), blue negative (decrease)
)

# 5. Create formatted popup and label content
# Popups appear on click
popup_2013 <- ~lapply(sprintf(
  "<strong>Location ID: %s</strong><br/>2012 Cases: %d<br/>2013 Cases: %d<br/>YoY Change: %s",
  SP_ID, Den2012, Den2013, scales::percent(YoY_2013, accuracy = 0.1)
), HTML)

popup_2014 <- ~lapply(sprintf(
  "<strong>Location ID: %s</strong><br/>2013 Cases: %d<br/>2014 Cases: %d<br/>YoY Change: %s",
  SP_ID, Den2013, Den2014, scales::percent(YoY_2014, accuracy = 0.1)
), HTML)

popup_2015 <- ~lapply(sprintf(
  "<strong>Location ID: %s</strong><br/>2014 Cases: %d<br/>2015 Cases: %d<br/>YoY Change: %s",
  SP_ID, Den2014, Den2015, scales::percent(YoY_2015, accuracy = 0.1)
), HTML)

# Labels appear on hover
label_2013 <- ~scales::percent(YoY_2013, accuracy = 0.1)
label_2014 <- ~scales::percent(YoY_2014, accuracy = 0.1)
label_2015 <- ~scales::percent(YoY_2015, accuracy = 0.1)
# Load and transform the neighborhood shapefile
town_map_raw <- st_read("Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp")
## Reading layer `Merida_Den12_13_14_15' from data source 
##   `/Users/Tom/Desktop/dengueAnalysisTechWriting-main/Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 540 features and 6 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 212287.2 ymin: 2309495 xmax: 236386 ymax: 2333181
## Projected CRS: Mexico ITRF2008 / UTM zone 16N
town_map_wgs84 <- st_transform(town_map_raw, crs = 4326)

# Create the interactive leaflet map
m <- leaflet(data_sf_wgs84) %>%
  # Add base map layers
  addProviderTiles(providers$CartoDB.Positron, group = "Light Map") %>%
  addProviderTiles(providers$Esri.WorldImagery, group = "Satellite") %>%
  
  # Add Neighborhoods layer
  addPolygons(
    data = town_map_wgs84,
    group = "Neighborhoods",
    fillOpacity = 0.1,
    color = "#444",
    weight = 1,
    opacity = 0.7,
    popup = ~~SP_ID
  ) %>%
  
  # Set the map to focus on the data
  fitBounds(
    ~as.numeric(st_bbox(geometry)[1]),
    ~as.numeric(st_bbox(geometry)[2]),
    ~as.numeric(st_bbox(geometry)[3]),
    ~as.numeric(st_bbox(geometry)[4])
  ) %>%

  # ==== Layer 1: 2013 vs 2012 ====
  addCircleMarkers(
    group = "2013 vs 2012",
    color = ~pal_yoy(YoY_2013),
    fillOpacity = 0.8,
    stroke = FALSE,
    radius = 5,
    popup = popup_2013,
    label = label_2013
  ) %>%
  
  # ==== Layer 2: 2014 vs 2013 ====
  addCircleMarkers(
    group = "2014 vs 2013",
    color = ~pal_yoy(YoY_2014),
    fillOpacity = 0.8,
    stroke = FALSE,
    radius = 5,
    popup = popup_2014,
    label = label_2014
  ) %>%

  # ==== Layer 3: 2015 vs 2014 ====
  addCircleMarkers(
    group = "2015 vs 2014",
    color = ~pal_yoy(YoY_2015),
    fillOpacity = 0.8,
    stroke = FALSE,
    radius = 5,
    popup = popup_2015,
    label = label_2015
  ) %>%
  
  # ==== Legend ====
  addLegend(
    pal = pal_yoy,
    values = cap_range,
    title = "YoY % Change",
    position = "bottomright",
    labFormat = labelFormat(suffix = "%", transform = function(x) x * 100),
    opacity = 1
  ) %>%

  # ==== Layer Control ====
  # *** UPDATED: Added "Neighborhoods" to overlayGroups ***
  addLayersControl(
    baseGroups = c("2013 vs 2012", "2014 vs 2013", "2015 vs 2014"),
    overlayGroups = c("Light Map", "Satellite", "Neighborhoods"),
    options = layersControlOptions(collapsed = FALSE)
  )
## Warning in pal_yoy(YoY_2013): Some values were outside the color scale and will
## be treated as NA

## Warning in pal_yoy(YoY_2013): Some values were outside the color scale and will
## be treated as NA
## Warning in pal_yoy(YoY_2014): Some values were outside the color scale and will
## be treated as NA

## Warning in pal_yoy(YoY_2014): Some values were outside the color scale and will
## be treated as NA
## Warning in pal_yoy(YoY_2015): Some values were outside the color scale and will
## be treated as NA

## Warning in pal_yoy(YoY_2015): Some values were outside the color scale and will
## be treated as NA
# Display the map
m

8. Conclusion and Future Questions